Crash course in R

Module 1: Portfolio Analysis with R

Prof. Matthew G. Son

University of South Florida

Demonstration Helper

Extensions: collaboation tools

We are using OCT (Open Collaboration Tools) extension for demonstrations.

  • Check out the installation guide

You’ll be able to browse my working directory / files freely.

  • I’ll make it read-only, but you’ll be able to copy and paste to yours

Keybindings

Some essential keybingings may not be available in VSCode.

You can copy paste them to your settings.json file.

Financial Data Analysis

tidyverse (dplyr) package

Grammar of data manipulation

  • A package for manipulating data in R

    • Especially designed for data.frame object

    • Greatly simplifies existing functionality in base R

  • Makes workflow intuitive, readable

  • Also fast, as key operations are written in C++

    • tidyverse includes dplyr package

dplyr grammar

Key verbs

  • Verbs for the core functionality
  • Those verbs are used with pipe operator |>
  1. select() : select subset of columns
    • rename() : rename columns
    • relocate() : change column positions
  2. filter() : select subset of rows with condition
  3. arrange() : reorder rows
  4. mutate() : add new columns (variables)
  5. summarize() : generate summary table based on group_by()

dplyr::select()

select() works on columns (variables).

Example 1:

iris |>
  select(Sepal.Length, Petal.Width, Species) |>
  head(3)
  Sepal.Length Petal.Width Species
1          5.1         0.2  setosa
2          4.9         0.2  setosa
3          4.7         0.2  setosa

Example 2:

Select from “Sepal.Length” (1st) to “Petal.Length” (3rd) column

iris |> select(Sepal.Length:Petal.Length) |> head(3)
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
iris |> select(1:3) |> head(3) # the same
  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3

Example 3:

Select columns except for specified columns

iris |>
  select(!(Sepal.Length:Petal.Length)) |>
  head(3)
  Petal.Width Species
1         0.2  setosa
2         0.2  setosa
3         0.2  setosa

Convenience functions

Convenience functions are allowed within select().

Examples:

  • starts_with(), ends_with(), contains()

  • …and lots more!

  • matches(), num_range(), all_of(), any_of(), everything(), last_col(), where()

  • Check tidyselect documentation

iris |>
  select(starts_with("Sepal")) |>
  head(3)
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
iris |>
  select(ends_with("Length")) |>
  head(3)
  Sepal.Length Petal.Length
1          5.1          1.4
2          4.9          1.4
3          4.7          1.3
iris |>
  select(contains("al")) |>
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2

dplyr::rename()

iris |>
  rename(sepal_len = Sepal.Length) |> # change variable name
  head(3)
  sepal_len Sepal.Width Petal.Length Petal.Width Species
1       5.1         3.5          1.4         0.2  setosa
2       4.9         3.0          1.4         0.2  setosa
3       4.7         3.2          1.3         0.2  setosa

dplyr::relocate()

Rearrange the column order.

Non-mentioned columns retain original ordering.

iris |> 
  relocate(Species, ends_with('Length')) |> 
  head(3)
  Species Sepal.Length Petal.Length Sepal.Width Petal.Width
1  setosa          5.1          1.4         3.5         0.2
2  setosa          4.9          1.4         3.0         0.2
3  setosa          4.7          1.3         3.2         0.2

dplyr::filter()

filter() works on reducing the rows based on the condition.

Example 1: Subset rows with conditions

diamonds |> 
  filter(carat > 0.23) |> 
  head(3)
# A tibble: 3 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
2  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
3  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Example 2: Multiple conditions

diamonds |> 
  filter(carat < 0.28 & carat > 0.23, # use &
         cut == 'Good') |> # or condition
  head(3)
# A tibble: 3 × 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.26 Good  D     VS2      65.2    56   403  3.99  4.02  2.61
2  0.26 Good  D     VS1      58.4    63   403  4.19  4.24  2.46
3  0.26 Good  E     VVS1     57.9    60   554  4.22  4.25  2.45
  • c.f.) If you want to filter rows based on row numbers, use slice()
diamonds |> 
  slice(1,3) # slice first and 3rd row
# A tibble: 2 × 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31

Class Exercise

From diamonds dataframe (load tidyverse for access),

  1. Subset the dataframe with below conditions:

    1. carat is equal to 0.26 and;

    2. clarity is “VS2” and;

    3. select columns “carat”, “cut”, “clarity”

    4. then store it as sub_diamonds

  2. What is the dimension of sub_diamonds? Check with dim(sub_diamonds)

dplyr::arrange()

Arrange, or sort the dataframe based on the specified column value.

iris |>
  arrange(Sepal.Length) |> # ascending by default
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          4.3         3.0          1.1         0.1  setosa
2          4.4         2.9          1.4         0.2  setosa
3          4.4         3.0          1.3         0.2  setosa
  • Can be arranged in descending order
iris |>
  arrange(-Sepal.Length) |> # or, arrange(desc(Sepal.Length))
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1          7.9         3.8          6.4         2.0 virginica
2          7.7         3.8          6.7         2.2 virginica
3          7.7         2.6          6.9         2.3 virginica

dplyr::mutate()

Compute transformation of variables and create new column.

iris |> 
  mutate(new_length = Sepal.Length + Petal.Length) |> 
  head(3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_length
1          5.1         3.5          1.4         0.2  setosa        6.5
2          4.9         3.0          1.4         0.2  setosa        6.3
3          4.7         3.2          1.3         0.2  setosa        6.0

Example: Z-score standardizing

\[ \frac{X - \bar{X}}{\sigma_X} = \frac{X - mean(X)}{sd(X)} \]

iris |>
  mutate(
    std_sepal_length = (Sepal.Length - mean(Sepal.Length)) / sd(Sepal.Length)
  ) |>
  select(Sepal.Length, std_sepal_length) |>
  head(3)
  Sepal.Length std_sepal_length
1          5.1       -0.8976739
2          4.9       -1.1392005
3          4.7       -1.3807271

dplyr::summarize()

Many times we are interested in getting summary statistics for groups.

  • Summarizing is also called as data aggregation

  • Often used with group_by() , to generate summary

  1. Example: What is the average of “Sepal.Length” by “Species” in iris dataset?
# Let's browse iris data
iris |> 
  group_by(Species) |> 
  slice(1:3) # show first 1~3 rows from each Species
# A tibble: 9 × 5
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5.1         3.5          1.4         0.2 setosa    
2          4.9         3            1.4         0.2 setosa    
3          4.7         3.2          1.3         0.2 setosa    
4          7           3.2          4.7         1.4 versicolor
5          6.4         3.2          4.5         1.5 versicolor
6          6.9         3.1          4.9         1.5 versicolor
7          6.3         3.3          6           2.5 virginica 
8          5.8         2.7          5.1         1.9 virginica 
9          7.1         3            5.9         2.1 virginica 

Above code shows the first 3 rows for each Species.

iris |> 
  group_by(Species) |> 
  summarize(avg_sepal_length = mean(Sepal.Length)) 
# A tibble: 3 × 2
  Species    avg_sepal_length
  <fct>                 <dbl>
1 setosa                 5.01
2 versicolor             5.94
3 virginica              6.59

dplyr::summarize()

What if we want to summarize across every column?

  • Example: Calculate average of across all columns by each group
iris |>
  group_by(Species) |>
  summarize(across(everything(), mean))
# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33 
3 virginica          6.59        2.97         5.55       2.03 
iris |> 
  group_by(Species) |> 
  summarize(across(ends_with("Length"), mean))
# A tibble: 3 × 3
  Species    Sepal.Length Petal.Length
  <fct>             <dbl>        <dbl>
1 setosa             5.01         1.46
2 versicolor         5.94         4.26
3 virginica          6.59         5.55

Lab Exercise

From diamonds dataframe:

  1. How many observations (rows) have carat value greater than 3.1?
  2. What is the average of price of which carat equals to 1.0 ?
    • Use filter() and summarize()
  3. What is the minimum, average, maximum of the price by cut?
    • Use min() and max()
  4. How many observations are found by each cut?
    • Use n()
  5. Arrange the dataframe by carat (descending) and then price (ascending).
  6. Mutate a new column, named xyz, which is mulplication of x and y and z. Store the dataframe as my_diamond.
  7. What is the maximum value of xyz (max(my_diamond$xyz))?
  8. What is the correlation between carat and price by each cut?
    • Use cor(x,y) for correlation between x and y

Data Tidying

Tidy data

The real world data does not come clean, ready for your analysis.

You will learn a consistent data structure, i.e. Tidy data.

Data examples

Same data can be presented in various ways. The example data has 4 information:

  • country
  • year
  • population
  • Number of TB (tuberculosis) cases

Data (table1, table2, table3) is is available when you load tidyverse

Example 1

country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

Example 2

country year type count
Afghanistan 1999 cases 745
Afghanistan 1999 population 19987071
Afghanistan 2000 cases 2666
Afghanistan 2000 population 20595360
Brazil 1999 cases 37737
Brazil 1999 population 172006362
Brazil 2000 cases 80488
Brazil 2000 population 174504898
China 1999 cases 212258
China 1999 population 1272915272
China 2000 cases 213766
China 2000 population 1280428583

Example 3

country year rate
Afghanistan 1999 745/19987071
Afghanistan 2000 2666/20595360
Brazil 1999 37737/172006362
Brazil 2000 80488/174504898
China 1999 212258/1272915272
China 2000 213766/1280428583

Tidy data

Which data structure is easier to work with, for general purpose?

  • table1 is Tidy data.

Tidy data

A definition of tidy data:

  1. Each variable is a column;
  2. Each observation is a row;
  3. Each value is a cell; each cell is a single value.

Tidy data

When data is tidy, it is generally easier to work with the data.

  • However, sometimes you’ll need to pivot (reshape) the data for your analysis.

Example

If you want to generate a rate of TB per 10,000 population on table1:

table1 |>
  mutate(rate = cases / population * 10000) |>
  print()
# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <dbl>  <dbl>      <dbl> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67 

It is not simple to operate this with table2 setting.

table2 # where should rate variable be located??
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Quick exercise

Load package tidyverse.

table1 will be ready for access.

  1. What is the average population of each country? (Hint: group_by() and summarize())
  2. What the total sum of TB cases of all years by each country?
  3. What is the average percentage ratio of TB cases over population by each country?

Answers

1.

library(tidyverse)
table1 |> 
  group_by(country) |> 
  summarize(avg_pop = mean(population))
# A tibble: 3 × 2
  country         avg_pop
  <chr>             <dbl>
1 Afghanistan   20291216.
2 Brazil       173255630 
3 China       1276671928.

2.

table1 |> 
  group_by(country) |> 
  summarize(total_cases = sum(cases))
# A tibble: 3 × 2
  country     total_cases
  <chr>             <dbl>
1 Afghanistan        3411
2 Brazil           118225
3 China            426024

3.

table1 |> 
  mutate(TBratio = cases/population) |> 
  group_by(country) |> 
  summarize(avg_rate_pct = mean(TBratio) * 100)
# A tibble: 3 × 2
  country     avg_rate_pct
  <chr>              <dbl>
1 Afghanistan      0.00834
2 Brazil           0.0340 
3 China            0.0167 

Pivot data

For your analysis, you will need to pivot the data (aka reshape):

  • to longer form (less variables and more observations)

  • or to wide form (more columns and less rows)

Tip

Whenever you pivot the data, think about columns that are affected, and the names and values.

Billboard example

Let’s take a look at billboard dataset which is in wide form.

Pivot longer

Each observation is a song, and we have 76 columns that describe rank.

To tidy up, we want “rank” variable in column that stores the number.

How can we pivot the data so that we have rank in one column?

Simple illustration on how pivot_longer() works:

Pivot Billboard data

artist track date.entered week rank
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA

Exercise

  1. Load billboard data with
data(billboard)
print(billboard)
  1. Select artist, track the columns that are “wk1” to “wk12”
  • Use num_range() in select()
  • Browse ?num_range
  1. Pivot the data to longer form.

Pivot data: wider

pivot_wider() works in opposite way in that:

  • increases the number of columns (variables)

  • decreases the number of rows (observations)

Wider forms are common for machine learning.

  • One hot encoding / dummy variables

Example

Previous billboard_longer data:

head(billboard_longer, 5)
# A tibble: 5 × 5
  artist track                   date.entered week   rank
  <chr>  <chr>                   <date>       <chr> <dbl>
1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87

Pivot the data to wider form:

billboard_longer |>
  pivot_wider(
    names_from = week, # must be provided
    values_from = rank # must be provided
  ) |>
  head(5)
# A tibble: 5 × 79
  artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
  <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Note that we need at least two inputs for pivot_wider().

Financial Data Example

Stock price data commonly comes in the form below:

symbol date open high low close volume adjusted
TSLA 2023-01-03 118.47 118.80 104.64 108.10 231402800 108.10
TSLA 2023-01-04 109.11 114.59 107.52 113.64 180389000 113.64
TSLA 2023-01-05 110.51 111.75 107.16 110.34 157986300 110.34
TSLA 2023-01-06 103.00 114.39 101.81 113.06 220911100 113.06
TSLA 2023-01-09 118.96 123.52 117.11 119.77 190284000 119.77
TSLA 2023-01-10 121.07 122.76 114.92 118.85 167642500 118.85

Question: What if you wanted to have a column for each stock’s adjusted price, like below?

date TSLA BAC XOM
2023-01-03 108.10 30.97427 96.07030
2023-01-04 113.64 31.55659 96.34991
2023-01-05 110.34 31.49190 98.50566
2023-01-06 113.06 31.80616 99.69628
2023-01-09 119.77 31.32551 97.83819
2023-01-10 118.85 31.53811 99.29941
stock_prices |> 
  select(symbol, date, adjusted) |> 
  pivot_wider(
    names_from = symbol,
    values_from = adjusted)

Class Exercise

  1. Examine data structure of table1 and table2 available on tidyverse.

How can you transform table1 to table2?

Answer

table1 |>
  pivot_longer(
    cols = c(cases, population), # multiple selection: use c()
    names_to = 'type',
    values_to = 'count'
  )
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Lab problem

Examine data population available from tidyverse.

Pivot the data to generate output as below.

(Hint: Browse document file ?pivot_wider() and see names_prefix argument.)

country year_1995 year_1996 year_1997 year_1998 year_1999 year_2000 year_2001 year_2002 year_2003 year_2004 year_2005 year_2006 year_2007 year_2008 year_2009 year_2010 year_2011 year_2012 year_2013
Afghanistan 17586073 18415307 19021226 19496836 19987071 20595360 21347782 22202806 23116142 24018682 24860855 25631282 26349243 27032197 27708187 28397812 29105480 29824536 30551674
Albania 3357858 3341043 3331317 3325456 3317941 3304948 3286084 3263596 3239385 3216197 3196130 3179573 3166222 3156608 3151185 3150143 3153883 3162083 3173271
Algeria 29315463 29845208 30345466 30820435 31276295 31719449 32150198 32572977 33003442 33461345 33960903 34507214 35097043 35725377 36383302 37062820 37762962 38481705 39208194
American Samoa 52874 53926 54942 55899 56768 57522 58176 58729 59117 59262 59117 58652 57919 57053 56245 55636 55274 55128 55165
Andorra 63854 64274 64090 63799 64084 65399 68000 71639 75643 79060 81223 81877 81292 79969 78659 77907 77865 78360 79218
Angola 12104952 12451945 12791388 13137542 13510616 13924930 14385283 14886574 15421075 15976715 16544376 17122409 17712824 18314441 18926650 19549124 20180490 20820525 21471618
Anguilla 9807 10063 10305 10545 10797 11071 11371 11693 12023 12342 12637 12903 13145 13365 13571 13768 13956 14132 14300
Antigua and Barbuda 68349 70245 72232 74206 76041 77648 78972 80030 80904 81718 82565 83467 84397 85349 86300 87233 88152 89069 89985
Argentina 34833168 35264070 35690778 36109342 36514558 36903067 37273361 37627545 37970411 38308779 38647854 38988923 39331357 39676083 40023641 40374224 40728738 41086927 41446246
Armenia 3223173 3173425 3137652 3112958 3093820 3076098 3059960 3047002 3036032 3025652 3014917 3002911 2989882 2977488 2968154 2963496 2964120 2969081 2976566

Financial Data

Data in Finance

Financial data encompasses information related to financial markets, instruments, and economic indicators. It includes:

  • Stock prices
  • Bonds and interest rates
  • Commodities prices
  • Real estate prices
  • Exchange rates
  • Financial statements

Types of Financial Data

  1. Time Series Data: Sequential data points over time, e.g., daily stock prices.
  2. Cross-sectional Data: Data at a single point in time across many entities, e.g., balance sheets of various companies.
  3. Panel Data: Combines time series and cross-sectional data, e.g., yearly revenue of several companies over a decade.

Financial Data Sources 1

  • Exchanges: NYSE, NASDAQ
  • Government Publications: FRED, U.S. Treasury, Economic reports
  • Financial News Outlets: Bloomberg, Reuters
  • Data Providers: Yahoo Finance, Quandl (NASDAQ Data Link)

Financial Data Sources 2

  • SEC EDGAR: U.S. corporate filings, public companies’ financials
  • World Bank: Global economic indicators, international trade
  • U.S. Census Bureau: Provides demographic information, and social indicators
  • Macroeconomic Indicators:
    • Bureau of Economic Analysis (BEA): For U.S. economic accounts, including GDP and personal income.
    • Federal Reserve Economic Data (FRED): A comprehensive database of U.S. financial and economic data.
    • International Monetary Fund (IMF): For global financial stability reports, world economic outlooks, and international financial statistics.

Recap on Portfolio theory

Risk and Return

The risk-return tradeoff is a fundamental concept in finance

  • Higher Risk: Greater potential for return but higher chance of loss
  • Lower Risk: More predictable outcomes, but typically lower returns

Calculating Returns

Return Calculation for Individual Assets:

The holding period return of an asset (\(r\)) over a period is calculated using the formula:

\[ r = \frac{P_{end} - P_{begin} + D}{P_{begin}} \]

  • \(r\) is the rate of return for the holding period

  • \(P_{end}\) is the ending price of the asset

  • \(P_{begin}\) is the beginning price of the asset

  • \(D\) represents any dividends or income received during the period

Calculating Returns

Tip

  • We don’t need to take care of \(D\) when we calculate return based on adjusted close price
  • Typically holding period is set with the same frequency, such as minute, hourly, daily, weekly, or monthly.

Portfolio Return

The return on a portfolio is a weighted sum of the individual returns of the assets within the portfolio.

  • Portfolio Return Formula:

The return of a portfolio (\(r_p\)) is calculated as:

\[ r_p = \sum_{i=1}^{n} w_i r_i \]

  • \(r_p\) is the return of the portfolio over the period

  • \(w_i\) is the weight of asset \(i\) in the portfolio

  • \(r_i\) is the return of asset \(i\) over the period

  • \(n\) is the number of assets in the portfolio

Analytical Exercise

Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.

If 60% of your portfolio is invested in asset A and 40% in asset B,

Calculate the expected annual return of the portfolio.

Stand-Alone Risk

Stand-alone risk considers the risk of a single asset independently

  • Measured by Standard Deviation of historic period returns

The formula for risk (\(\sigma\)) is:

\[ \sigma = \sqrt{\frac{\sum (r_t - \bar{r})^2}{T - 1}} = sd(r) \]

Where:

  • \(r_t\) are the returns in period \(t\)
  • \(\bar{r}\) is the average return
  • \(T\) is the number of periods

Portfolio Risk

Portfolio risk involves the risk associated with holding a portfolio of assets.

  • Not simply the weighted average of the stand-alone risks of portfolio assets
  • Influenced by correlation between returns of assets

Analytical Exercise

Suppose you have a portfolio consisting of two assets, A and B. The expected annual return for asset A is 8%, and for asset B, it’s 12%.

The annual standard deviation of returns for asset A is 10% (0.1), and for asset B, it’s 15% (0.15). The correlation coefficient between the returns of assets A and B is 0.5.

If 60% of your portfolio is invested in asset A and 40% in asset B,

Calculate the expected annual risk of the portfolio.

\[ σ_p^2​=w_A^2​σ_A^2​+w_B^2​σ_B^2​+2w_A​w_B​σ_A​σ_B​ρXY​ \]

Portfolio Risk

Empirical approach

At each time \(t\), calculate the return of a portfolio (\(r_p\))

\[ r_{p,t} = \sum_{i=1}^{n} w_{i,t} r_{i,t} \]

Then calculate standard deviation of portfolio returns (\(\sigma_p\)) over \(t\)

\[ \sigma_p = sd(r_p) = \sqrt{\frac{\sum (r_{p,t} - \bar{r_p})^2}{T - 1}} \]

Walk-through Exercise

Step 1: Get stock prices

library(tidyverse)
library(tidyquant)
stock_prices <- tq_get(c('TSLA','BAC','XOM'), 
                      from = '2020-01-01', 
                      to = '2023-12-31')
stock_prices <- stock_prices |> 
    select(symbol, date, adjusted)
stock_prices |> head()
# A tibble: 6 × 3
  symbol date       adjusted
  <chr>  <date>        <dbl>
1 TSLA   2020-01-02     28.7
2 TSLA   2020-01-03     29.5
3 TSLA   2020-01-06     30.1
4 TSLA   2020-01-07     31.3
5 TSLA   2020-01-08     32.8
6 TSLA   2020-01-09     32.1

Step 2: Period returns

stock_returns <- stock_prices |>
  arrange(symbol, date) |>
  group_by(symbol) |>
  mutate(
    daily_return = adjusted / lag(adjusted) - 1,
  ) |>
  select(symbol, date, daily_return)

stock_returns <- stock_returns |> drop_na()

stock_returns |> head()
# A tibble: 6 × 3
# Groups:   symbol [1]
  symbol date       daily_return
  <chr>  <date>            <dbl>
1 BAC    2020-01-03     -0.0208 
2 BAC    2020-01-06     -0.00143
3 BAC    2020-01-07     -0.00660
4 BAC    2020-01-08      0.0101 
5 BAC    2020-01-09      0.00172
6 BAC    2020-01-10     -0.00828

Step 3: Pivot to Wide-form

We will use pivot technique, to deviate from tidy form.

returns_wide <- stock_returns  |> 
  pivot_wider(names_from = symbol, 
              values_from = daily_return)
returns_wide |> head()
# A tibble: 6 × 4
  date            BAC     TSLA      XOM
  <date>        <dbl>    <dbl>    <dbl>
1 2020-01-03 -0.0208   0.0296  -0.00804
2 2020-01-06 -0.00143  0.0193   0.00768
3 2020-01-07 -0.00660  0.0388  -0.00818
4 2020-01-08  0.0101   0.0492  -0.0151 
5 2020-01-09  0.00172 -0.0219   0.00766
6 2020-01-10 -0.00828 -0.00663 -0.00889

Drop row with missing values

returns_wide <- returns_wide |> 
  drop_na()

Step 4: Generate portfolio return

# Method 2: Generating portfolio returns and calculating standard deviation
weights <- c(0.2, 0.3, 0.5)
returns_wide <- returns_wide |>
  rowwise() |>
  mutate(port_ret = sum(c_across(!date) * weights)) |>
  ungroup()
returns_wide |> head()
# A tibble: 6 × 5
  date            BAC     TSLA      XOM  port_ret
  <date>        <dbl>    <dbl>    <dbl>     <dbl>
1 2020-01-03 -0.0208   0.0296  -0.00804  0.000718
2 2020-01-06 -0.00143  0.0193   0.00768  0.00933 
3 2020-01-07 -0.00660  0.0388  -0.00818  0.00623 
4 2020-01-08  0.0101   0.0492  -0.0151   0.00924 
5 2020-01-09  0.00172 -0.0219   0.00766 -0.00241 
6 2020-01-10 -0.00828 -0.00663 -0.00889 -0.00809 

Step 5: Calculate Portfolio risk (sd)

portfolio_risk_m2 <- returns_wide |> 
    summarize(across(!date, sd))
portfolio_risk_m2
# A tibble: 1 × 4
     BAC   TSLA    XOM port_ret
   <dbl>  <dbl>  <dbl>    <dbl>
1 0.0241 0.0429 0.0235   0.0220

Recap on Capital Asset Pricing Model

CAPM

Derived from portfolio theory and optimization.

\[ E(r_i​)=r_f​+β_i​(E(r_m​)−r_f​)\]

Where:

  • \(E(r_i)\) is the expected return of the investment

  • \(r_f\) is the risk-free rate

  • \(\beta_i\) is the beta of the investment

  • \(E(r_m)\) is the expected return of the market

  • \((E(r_m) - r_f)\) is known as the market risk premium

Beta

Beta is a measure of the sensitivity of an individual investment’s returns to the market.

  • An indication of an asset’s risk relative to the systematic risk

\[ \beta_i = \frac{Cov(r_i - r_f, r_m-r_f)}{Var(r_m-r_f)} \]

  • Same as coefficient from below simple linear regression \(\beta_1\)

\[ r_i - r_f = \beta_0 + \beta_1(r_m - r_f)+ e_i \]

Tip

In some specific cases (e.g., intraday frequency) beta is estimated with raw return (\(r_i\) and \(r_m\)), not excess returns (\(r_i - r_f\))

Alpha

  • Alpha (\(\alpha\)) is a financial metric indicating the extra return compared to the return of a benchmark index
  • In CAPM, \(\alpha\) is the intercept of the regression
  • A positive \(\alpha\) indicates outperformance, while a negative indicates underperformance

Walk-through Exercise

Prepare a stock return, market return (index).

Then combine two dataframe (bind them in row-wise).

tickers <- c("MSFT")
stock_prices <- tq_get(tickers, from = '2020-01-01', to = '2023-12-31')
snp500 <- tq_get(
  "SP500",
  get = "economic.data",
  from = "2020-01-01",
  to = "2023-12-31"
)
snp500 <- snp500 |> rename(adjusted = price) # rename column for binding
prices <- bind_rows(stock_prices, snp500)
prices |> glimpse()
Rows: 2,049
Columns: 8
$ symbol   <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT…
$ date     <date> 2020-01-02, 2020-01-03, 2020-01-06, 2020-01-07, 2020-01-08, …
$ open     <dbl> 158.78, 158.32, 157.08, 159.32, 158.93, 161.84, 162.82, 161.7…
$ high     <dbl> 160.73, 159.95, 159.10, 159.67, 160.80, 162.22, 163.22, 163.3…
$ low      <dbl> 158.33, 158.06, 156.51, 157.32, 157.95, 161.03, 161.18, 161.2…
$ close    <dbl> 160.62, 158.62, 159.03, 157.58, 160.09, 162.09, 161.34, 163.2…
$ volume   <dbl> 22622100, 21116200, 20813700, 21634100, 27746500, 21385000, 2…
$ adjusted <dbl> 152.5057, 150.6067, 150.9960, 149.6193, 152.0025, 153.9014, 1…

Pivot price data into wide form:

price_wide = prices |> 
  pivot_wider(names_from = symbol, 
              values_from = adjusted,
              id_cols = date)
price_wide |> head()
# A tibble: 6 × 3
  date        MSFT SP500
  <date>     <dbl> <dbl>
1 2020-01-02  153. 3258.
2 2020-01-03  151. 3235.
3 2020-01-06  151. 3246.
4 2020-01-07  150. 3237.
5 2020-01-08  152. 3253.
6 2020-01-09  154. 3275.

Generate stock and index returns

# Use of anonymous function to calculate return
returns_wide <- price_wide |>
  mutate(across(-date, \(x) {
    (x / lag(x)) - 1
  }))

Prepare risk-free rate (use 3 month treasury bill rate)

risk_free_data = tq_get("DGS3MO", 
                        get = "economic.data", 
                        from = "2020-01-01", 
                        to = "2023-12-31")
risk_free_data |> head(3)
# A tibble: 3 × 3
  symbol date       price
  <chr>  <date>     <dbl>
1 DGS3MO 2020-01-01 NA   
2 DGS3MO 2020-01-02  1.54
3 DGS3MO 2020-01-03  1.52

The risk-free rate is in percentage term (%) and annualized.

Convert the rate into daily level

  • Use 252 business days assumption
  • Use simple division for daily conversion \(r_d = r_y/252\)
risk_free_converted <- risk_free_data |>
  mutate(rf = (price / 252) / 100)
risk_free_converted |> head()
# A tibble: 6 × 4
  symbol date       price         rf
  <chr>  <date>     <dbl>      <dbl>
1 DGS3MO 2020-01-01 NA    NA        
2 DGS3MO 2020-01-02  1.54  0.0000611
3 DGS3MO 2020-01-03  1.52  0.0000603
4 DGS3MO 2020-01-06  1.56  0.0000619
5 DGS3MO 2020-01-07  1.54  0.0000611
6 DGS3MO 2020-01-08  1.54  0.0000611

Convert to wide form:

risk_free_converted <- risk_free_converted |> 
  pivot_wider(id_cols= date,
              names_from = symbol,
              values_from = rf
              )
risk_free_converted |> head()
# A tibble: 6 × 2
  date           DGS3MO
  <date>          <dbl>
1 2020-01-01 NA        
2 2020-01-02  0.0000611
3 2020-01-03  0.0000603
4 2020-01-06  0.0000619
5 2020-01-07  0.0000611
6 2020-01-08  0.0000611

Join two dataframe column-wise

capm_data <- returns_wide |>
  left_join(risk_free_converted, by = "date")
capm_data |> head(3)
# A tibble: 3 × 4
  date           MSFT    SP500    DGS3MO
  <date>        <dbl>    <dbl>     <dbl>
1 2020-01-02 NA       NA       0.0000611
2 2020-01-03 -0.0125  -0.00706 0.0000603
3 2020-01-06  0.00258  0.00353 0.0000619

Then generate excess returns:

capm_data <- capm_data |> 
  mutate(
    date,
    MSFT_exret = MSFT - DGS3MO,
    Mkt_exret = SP500 - DGS3MO)

CAPM Estimation

Use lm() for linear regression fit.

capm_fit <- lm(MSFT_exret ~ Mkt_exret, capm_data)
print(capm_fit)

Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)

Coefficients:
(Intercept)    Mkt_exret  
  0.0005149    1.1718519  

To browse summary of regression:

summary(capm_fit)

Call:
lm(formula = MSFT_exret ~ Mkt_exret, data = capm_data)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.068986 -0.006335 -0.000456  0.006333  0.076456 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0005149  0.0003655   1.409    0.159    
Mkt_exret   1.1718519  0.0251585  46.579   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01154 on 996 degrees of freedom
  (45 observations deleted due to missingness)
Multiple R-squared:  0.6854,    Adjusted R-squared:  0.6851 
F-statistic:  2170 on 1 and 996 DF,  p-value: < 2.2e-16

How to access and extract coefficient estimates?

# Remember that the model object is "list"
capm_fit$coefficients
 (Intercept)    Mkt_exret 
0.0005148839 1.1718518803 
capm_fit$coefficients |> typeof()
[1] "double"

To extract beta estimate:

capm_fit$coefficients[[2]]
[1] 1.171852

Discussions

Timeframe

CAPM estimate (or any other) is largely dependent on the timeframe of your choice.

Consider following:

  • Estimate based on recent 1 month
  • Estimate based on recent 3 month
  • Estimate based on recent 12 month

Should they be similar? Usually not.

Exercise

Using our previous example, estimate the CAPM beta with the following:

  • Assume today is 2023-12-31.
  • Estimate based on recent 1 month data
  • Estimate based on recent 3 month data
  • Estimate based on recent 12 month data

Rolling Windows

Given a set “lookback” period, estimate should only use previous information.

To capture timely information at each period, rolling regressions are often performed.

Lab Problems

Problem 1

Generate stock prices with below.

stock_prices = tq_get(c('AAPL', 'MSFT'), from = '2020-01-01', to = '2023-12-31')

Using adjusted daily closing prices, calculate the expected (average) return and risk of

  • a equal-weighted portfolio

  • 30% on Apple and 70% on Microsoft

  • 70% on Apple and 30% on Microsoft

Problem 2

Based on the portfolio return (70% on Apple and 30% on Microsoft), calculate the portfolio beta.

  • Use daily periodicity.
  • Use 3 month treasury rate for risk-free rate.

Data I/O

Read and Write

  • Importing data is crucial initial process for any data science project.

  • We will learn how to read external data to R, in data.frame object.

  • Also how to write data.frame in R to a local file.

CSV files

  • CSV: Comma-Seperated Values

  • A plain, human-readable text data file

  • Minimalistic, widely used

  • Typically opened with Excel, but it is not an excel file!

  • Since it is text, R tries to “guess” the type of each column when importing

CSV Example

A csv representation of iris dataframe:

iris |> head(5)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

When saved to csv file:

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa

CSV files

Write and read csv

Many packages support writing/reading csv files;

  • base R (utils package): basic but slow
  • readr from tidyverse: fast, functional
  • vroom, data.table: extremely fast and functional

We use readr package and will discuss I/O speed later.

CSV files

To write a data.frame to a csv file: write_csv()

# saves csv file on current working directory
iris |> write_csv('iris_file.csv')
write_csv(iris, 'iris_file.csv') # same

To read a .csv file to a data.frame: read_csv()

my_iris <- read_csv('iris_file.csv')
head(my_iris)
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 

You can specify a downloadable url instad:

test_url <- read_csv(
  "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
)
test_url |> head()
# A tibble: 6 × 9
    mpg cylinders displacement horsepower weight acceleration model_year origin
  <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl>      <dbl> <chr> 
1    18         8          307        130   3504         12           70 usa   
2    15         8          350        165   3693         11.5         70 usa   
3    18         8          318        150   3436         11           70 usa   
4    16         8          304        150   3433         12           70 usa   
5    17         8          302        140   3449         10.5         70 usa   
6    15         8          429        198   4341         10           70 usa   
# ℹ 1 more variable: name <chr>

xlsx files

To read a .xlsx file to a data.frame: read_excel() from readxl package

exxon_statement <- readxl::read_excel(
  'data/financial-statements-exxon-mobil.xlsx',
  sheet = 1, # first sheet
  skip = 1
)
head(exxon_statement)
# A tibble: 6 × 11
  `in million USD`         `FY '09` `FY '10` `FY '11` `FY '12` `FY '13` `FY '14`
  <chr>                       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Net Income/Starting Line    19658    31398    42206    47681    33448    33615
2 Depreciation & Amortiza…    11917    14760    15583    15888    17182    17297
3 Non-Cash Items              -1474      129      262     3313      482     3110
4 Change in Working Capit…    -1663     2126    -2706   -10712    -6198    -8906
5 Cash from Operating Act…    28438    48413    55345    56170    44914    45116
6 Change in Fixed Assets …   -22491   -26871   -30975   -34271   -33669   -32952
# ℹ 4 more variables: `FY '15` <dbl>, `FY '16` <dbl>, `FY '17` <dbl>,
#   `FY '18` <dbl>

Other data formats

There are other common data formats:

  • ‘.dat’, ‘.sas7bdat’, ‘.dta’
  • ‘.rds’: R native
  • ‘.json’: flexible data structure (NoSQL)
  • ‘.parquet’,‘.feather’: columnar for big data storage

Data cleaning

janitor package offers simple variable name cleaner: clean_names().

exxon_statement <- exxon_statement |>
  janitor::clean_names()
exxon_statement |> head()
# A tibble: 6 × 11
  in_million_usd   fy_09  fy_10  fy_11  fy_12  fy_13  fy_14  fy_15  fy_16  fy_17
  <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Net Income/Sta…  19658  31398  42206  47681  33448  33615  16551   8375  19848
2 Depreciation &…  11917  14760  15583  15888  17182  17297  18048  22308  19893
3 Non-Cash Items   -1474    129    262   3313    482   3110   -750  -5313  -7921
4 Change in Work…  -1663   2126  -2706 -10712  -6198  -8906  -3505  -3288  -1754
5 Cash from Oper…  28438  48413  55345  56170  44914  45116  30344  22082  30066
6 Change in Fixe… -22491 -26871 -30975 -34271 -33669 -32952 -26490 -16163 -15402
# ℹ 1 more variable: fy_18 <dbl>

Exercise 1 : Read

  1. Install and load new packages: janitor and readxl
  2. Download the exxon mobile data to your directory.
  3. Read the statement file with read_excel() and store as exxon_statement
  4. Clean the variable names with `clean_names().

Exercise 2: Join

  1. Read exxon_categories.csv file with read_csv() as exxon_categories.
  2. Left join exxon_categories to exxon_statement as:
exxon_statement <- exxon_statement |>
  left_join(
    exxon_categories,
    # Your work here
  )

Exercise 3 : Cleaning and Tidying

  1. Rename variable in_million_usd to Account.
  2. Pivot to a long form and store as exxon_long with below arguments:
  • use starts_with()
  • names_to = “Year”
  • names_prefix = “fy_”
  • values_to = “M_USD”
exxon_long <- exxon_statement |>
  rename(Account = in_million_usd) |>
  pivot_longer(
    cols = starts_with("fy_"),
    names_prefix = "fy_",
    names_to = "Year",
    values_to = "M_USD"
  )

Grammar of Graphics (ggplot)

Case 1: Single Categorical variable

Categorical variables (factors) take a predefined set of values.

  • Ordered: size (Large, medium,…) grades (A, A-, …)
  • Unordered: country, ethnicity

To visualize distribution of categorical variable, bar plots are often used.

geom_bar() and geom_col()

geom_bar() : When you need count of single category

geom_col() : When you need different Y

modified_exxon <- read_csv("https://usfedu-my.sharepoint.com/:x:/g/personal/gson_usf_edu/ESS_E_ZEdlNEv7xiUJTcOLkBDcrCYPUtqVt3RpOcFPxSpA?download=1")

# Example 1: geom_bar()
modified_exxon |> 
  ggplot(aes(x = Category)) +
  geom_bar()

# Example 2: geom_col()
total_amt_bycat <- modified_exxon |> 
  summarize(Total_USD = sum(M_USD, na.rm = TRUE), .by = c(Category))


total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col()

Effective visualization

The end goal of the bar plot:

Adding color to Bar plots

To fill the color, use “fill”. For border color, use “color”.

total_amt_bycat |>
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue")

total_amt_bycat |>
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black")

Scaling Y axis

To change scientific notation to currency format: use scales package.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    labels = scales::label_currency(prefix = "$", suffix = "T", scale = 0.001)
  )

More ticks

To add more breaks (ticks) on Y axis, control n.breaks in continuous scale.

total_amt_bycat |>
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  )

Axis Labels

To add axis labels and plot title, use labs() function.

total_amt_bycat |> 
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  ) +
  labs(
    title = "Exxon Mobile Total Revenue / Expenditure",
    subtitle = "From Years 2009 - 2018",
    x = "",
    y = "Total Revenue / Expenditure ($)",
    caption = "Source: Exxon Financial Statements"
  )

Adding Themes

By default, ggplot uses gray background. To change, you can use other built-in themes.

total_amt_bycat |>
  ggplot(aes(x = Category, y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  ) +
  labs(
    title = "Exxon Mobile Total Revenue / Expenditure",
    subtitle = "From Years 2009 - 2018",
    x = "",
    y = "Total Revenue / Expenditure ($)",
    caption = "Source: Exxon Financial Statements"
  ) +
  theme_bw()

Reordering by value

To reorder factor variable according to its value: fct_reorder() on the variable.

total_amt_bycat |> 
  ggplot(aes(x = fct_reorder(Category, -Total_USD) , y = Total_USD)) +
  geom_col(fill = "lightblue", color = "black") +
  scale_y_continuous(
    n.breaks = 6,
    labels = scales::dollar_format(prefix = "$", suffix = "T", scale = 0.001)
  ) +
  labs(
    title = "Exxon Mobile Total Expenditure",
    subtitle = "From Years 2009 - 2018",
    x = "",
    y = "Total Expenditure ($)",
    caption = "Source: Exxon Financial Statements"
  ) +
  theme_bw()

Case 2: Single Numeric variable

Numeric variable has infinite range of values.

  • In finance, risk and returns, ratios (ROA)

Commonly used visualization is histogram

  • Proper bin size is important for effective visualization.

An alternative for numeric distribution is density plot.

  • Uses kernel smoothing
  • Bandwith (bw) choice is important

Case 3: Numeric + Categorical

Visualizing two (or more) variables can be useful to show variable relationships.

  • Boxplot, Violin, Density plots

Case 4: Two categorical

Barplots can be used to visualize two categorical variables.

  • Use “fill” aesthetic to add one more dimension

Case 5: Two numerical

Scatterplots and regressions can effectively visualize relationships.

Case 6: Time series

Time series plots have date variable on X axis.

Line plots are often used.

Time series: Candlestick

Candlestick price charts are often used as well: